Background read-only region creation#1919
Conversation
When the Crucible Agent is requested to create a read-only region from a remote Downstairs source, this currently blocks the worker thread as region creation is performed in the worker loop, and it cannot respond to other state changes. This commit spawns region creation threads that the main worker thread can send requests to, and sends all read-only region creation requests there. This builds on the previous work to separate the serialized on-disk types from the in-memory types: a `Creating` state is added to the in-memory type and used while this background creation is occurring.
leftwo
left a comment
There was a problem hiding this comment.
Thanks for the work here, I have some questions for you.
| let log0 = log.new(o!("component" => "worker")); | ||
| let df0 = Arc::clone(&df); | ||
| std::thread::spawn(|| { | ||
| tokio::spawn(async { |
There was a problem hiding this comment.
If we are going from a real thread to a tokio task, could a long running region create trip us up here? The old way was with a thread which seemed like it could go off and do whatever for an hour and the rest of the agent could continue working. Do we run any risk of that here?
There was a problem hiding this comment.
I'm not sure about all the differences between threads and tasks, but I don't think there's a risk. With worker running in a thread or with a task, the read/write region creation occurs separately from the dropshot server and datafile manipulation logic.
| ); | ||
| df.fail(&r.id); | ||
| break 'requested; | ||
| std::process::exit(1); |
There was a problem hiding this comment.
what happens to the agent if we fail like this? Is it going to crash and restart?
There was a problem hiding this comment.
I'm a little concerned that, if we fail here we restart the whole agent. If it's a persistant failure, then we get ourselves into a crash loop?
Looking at the error though, it's a failure to send the request over the channel to one of our worker threads, and that should be a difficult situation to reach correct? And, if we do see it, a restart of the process is likely to behave differently? I just want to avoid setting ourselves up for crash loop.
When the Crucible Agent is requested to create a read-only region from a remote Downstairs source, this currently blocks the worker thread as region creation is performed in the worker loop, and it cannot respond to other state changes.
This commit spawns region creation threads that the main worker thread can send requests to, and sends all read-only region creation requests there.
This builds on the previous work to separate the serialized on-disk types from the in-memory types: a
Creatingstate is added to the in-memory type and used while this background creation is occurring.